Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

نویسندگان

چکیده

Pretrained vision-language models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous using task-specific training data. Despite the performance improvements on tasks, several studies reported that CoOp suffers from overfitting issue two aspects: (i) test accuracy base classes first improves and then worsens during training;(ii) novel keeps decreasing. However, none existing can understand mitigate problems. In this study, we explore cause by analyzing gradient flow. Comparative experiments reveal favors generalizable spurious features early later stages, respectively, leading non-overfitting phenomena. Given those observations, propose Subspace Prompt Tuning ( Sub PT) project gradients back-propagation onto low-rank subspace spanned early-stage flow eigenvectors entire process successfully eliminate problem. addition, equip a Novel Feature Learner (NFL) enhance ability learned categories beyond set, needless image Extensive 11 classification datasets demonstrate PT+NFL consistently boost outperform state-of-the-art CoCoOp approach. Experiments more challenging including open-vocabulary object detection zero-shot semantic segmentation, also verify effectiveness method. Codes be found at https://tinyurl.com/mpe64f89.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Contextual Information and Specific Language Models for Spoken Language Understanding

In this paper we explain how contextual expectations are generated and used in the task-oriented spoken language understanding system Dialogos. The hard task of recognizing spontaneous speech on the telephone may greatly benefit from the use of specific language models during the recognition of callers' utterances. By 'specific language models' we mean a set of language models that are trained ...

متن کامل

Language understanding using hidden understanding models

We describe the rst sentence understanding system that is completely based on learned methods both for understanding individual sentences, and determinig their meaning in the context of preceding sentences. We describe the models used for each of three stages in the understanding: semantic parsing, semantic classi cation, and discourse modeling. When we ran this system on the last test (Decembe...

متن کامل

Nonparametric Bayesian Models for Spoken Language Understanding

In this paper, we propose a new generative approach for semantic slot filling task in spoken language understanding using a nonparametric Bayesian formalism. Slot filling is typically formulated as a sequential labeling problem, which does not directly deal with the posterior distribution of possible slot values. We present a nonparametric Bayesian model involving the generation of arbitrary na...

متن کامل

Fertility Models for Statistical Natural Language Understanding

Several recent efforts in statistical natural language understanding (NLU) have focused on generating clumps of English words from semantic meaning concepts (Miller et al., 1995; Levin and Pieraccini, 1995; Epstein et al., 1996; Epstein, 1996). This paper extends the IBM Machine Translation Group's concept of fertility (Brown et al., 1993) to the generation of clumps for natural language unders...

متن کامل

Learning Finite-State Models For Language Understanding

Language Understanding in limited domains is here approached as a problem of language tra~lation in which the target language is a ]o~nal language rather than a natural one. Finite-state transducers are used to model the translation process. Furthermore, these models are automatically learned from ironing data consisting of pairs of natural-language/formal-language sentences. The need for train...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Circuits and Systems for Video Technology

سال: 2023

ISSN: ['1051-8215', '1558-2205']

DOI: https://doi.org/10.1109/tcsvt.2023.3245584